Summary¶
In this notebook we analyze Historical NFL game data. We attempt to predict the probability that the next pass play will result in a sack. Sacks can be a huge moment in NFL games where the momentum can totally shift or it can be the nail in the coffin. Hence, there is a lot of motivation to be able to predict the likelihood of a sack occuring for both the offense, defense, and even the viewers experience which is the case in the context of Swish Analytics as they are providing data for sports betting platforms.
The data provided includes play by play information for all games in the 2021-2023 seasons. Additionally there was metadata provided which contained information about team rosters in those same years, all nfl players of all time, depth charts, playing time information, and lastly advanced stats for defensive players, rushers, and passers.
Our analysis goes as follows
- Assess the data
- Build several predictive models, and finally
- Compare and evaluate their performances against one another
Below you will find the exploratory data analysis including plots and graphs, feature engineering, and code assembled in one place.
Data Dictionaries¶
The following data dictionaries were crucial in helping with the analysis:
Depicted above is a nice hard hitting sack of Tom Brady.
Brainstorm¶
Initial Ideas¶
My initial ideas for data that would be helpful in predicting the likelihood of a sack.
- Historical number of sacks for each player on the defense
- You could also weight this by position, giving a higher weight to defensive lineman and a lower weight to strong safeties and corners (who sometimes are included in a blitz)
- Sacks allowed by the offensive linemen
- This would be telling if the offensive lineman just tend to allow more sacks but we wouldn't expect this to be that big of a factor
- The number of times the particular quarterback in play has been sacked
- Down and distance
- A sack must occur on a passing play or intended passing play
- Would expect there to be more passes on later downs, but given a certain score differential and a time of the game, passes could become more likely on earlier downs
- Field position
- Not exactly sure, but we do believe a sack would be unlikely behind your own 15-20ydl
- Score of the game
- Larger differential would make the team that is down more likely to pass in more desperate situations which would perhaps lead to more sacks
- Timestamp of the game
- more desperate later in the game could result in more sacks, not sure
Mathematical Notation¶
For convenience let's establish a bit of notation around the probability distribution that a given passing play will result in a sack.
- Let $N$ be the total number of pass plays in our data set
- Let $S$ be the total number of sacks in our same dataset
- Let $X$ be our collection of data including metadata about each play. With $\boldsymbol x_i$ being the $i^\text{th}$ row of $X$ which contains all of the information we have about a particular play.
- Let $\boldsymbol y$ be the associated outcomes of the plays in $X$. Where $y_i = 1$ if the $i^\text{th}$ play resulted in a sack and $y_i = 0$ if it was not a sack.
- Finally, let $P(y_i=1|\boldsymbol x_i)$ be the probability that the $i^\text{th}$ play is a sack given the data $\boldsymbol x_i$ we have about that play. Similarly $P(y_i=0|\boldsymbol x_i)$ represents the probability that a play does not end in a sack.
Models to Try:¶
- Purely based on historical information
- For example, the most naive model would be to say the probability that a given pass play will be a sack is just equal to the number of sacks diveded by the total number of pass plays, denoted as $$ P(y_i=1|\boldsymbol x_i) = \frac S N \quad \text{and}\quad P(y_i=0|\boldsymbol x_i) = \frac {N - S} N.$$
- Then we could start to make it more complicated a little at a time
- For example, you could suppose that $\boldsymbol x_i$ might contain information like "It is 3rd and 11 from the defensive team's 25 yardline with 3:00 minutes left in the 4th quarter". Denoting the number of sacks in this scenario as little $\sigma$ and the number of passing plays in this scenario as $\theta$ then $$ P(y_i=1|\boldsymbol x_i) = \frac \sigma \theta \quad \text{and}\quad P(y_i=0|\boldsymbol x_i) = \frac {\theta - \sigma} \theta.$$
- You could get even more granular about each scenario based on how much information is incorporated into the data $X$.
- Train a model
- Logistic Regression (great at determing probability distributions and binary classification)
- Random forest regressor
- Does this work when building a distribution?
- We don’t just want to predict whether it’s a sack or not with a certain accuracy. We want to determine the odds it will happen
- Can deep learning help?
Setup¶
from typing import Optional
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.utils.class_weight import compute_sample_weight
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (
confusion_matrix,
ConfusionMatrixDisplay,
PrecisionRecallDisplay,
f1_score,
recall_score,
accuracy_score,
precision_score,
)
from xgboost import XGBClassifier
import altair as alt
import matplotlib.pyplot as plt
alt.data_transformers.disable_max_rows()
alt.renderers.enable('default')
pd.set_option('display.max_columns', None)
Exploratory Data Analysis¶
Let's begin by exploring and getting familiar with the various datasets which we have been provided with. Additionally for reference see the previously included data dictionaries. For the sake of presenting a cleaner noteobok, we have commented out all but a few especially informative lines where we am printing out the contents of the data.
# Look into the players.csv and see what's present
# players_df = pd.read_csv("data/players.csv", header=0, nrows=1000)
# players_df.head()
# depth_charts_2022_df = pd.read_csv("data/depth_charts_2022.csv", header=0)
# depth_charts_2022_df[(depth_charts_2022_df["club_code"] == "SEA") & (depth_charts_2022_df["formation"] == "Defense")].sort_values(by=["week", "position", "depth_team"]).head(5)
play_by_play_2022_df = pd.read_csv("data/play_by_play_2022.csv", header=0, low_memory=False)
# play_by_play_2022_df.sort_values(by=['week', 'play_id']).dropna(subset=["play_type"]).head()
number_of_nan_play_types_per_game = play_by_play_2022_df[play_by_play_2022_df.play_type.isna()].groupby(["game_id"], as_index=False).agg({"play_id": "count"})
number_of_nan_play_types_per_game.play_id.value_counts().sort_index(ascending=False)
play_id 6 29 5 255 Name: count, dtype: int64
My biggest takeaway at this point was that there are plays with play_type which is null and this occurs when it is the beginning of a game and at the end of each quarter. Additionally it occurs at the end of periods of overtime. This is really where the insight came.
Next Let's look more closely at a game I attended in Philly in 2022. After what I believe was an 8-0 start, the Eagles sadly experienced their first loss of the season that night to the Washington Commanders. Looking at this game's data will help me understand what information is actually being recorded and how it is being represented.
play_by_play_2022_df.where(play_by_play_2022_df.game_id == "2022_10_WAS_PHI").dropna(subset=["game_id"]).head()
| play_id | game_id | old_game_id | home_team | away_team | season_type | week | posteam | posteam_type | defteam | side_of_field | yardline_100 | game_date | quarter_seconds_remaining | half_seconds_remaining | game_seconds_remaining | game_half | quarter_end | drive | sp | qtr | down | goal_to_go | time | yrdln | ydstogo | ydsnet | desc | play_type | yards_gained | shotgun | no_huddle | qb_dropback | qb_kneel | qb_spike | qb_scramble | pass_length | pass_location | air_yards | yards_after_catch | run_location | run_gap | field_goal_result | kick_distance | extra_point_result | two_point_conv_result | home_timeouts_remaining | away_timeouts_remaining | timeout | timeout_team | td_team | td_player_name | td_player_id | posteam_timeouts_remaining | defteam_timeouts_remaining | total_home_score | total_away_score | posteam_score | defteam_score | score_differential | posteam_score_post | defteam_score_post | score_differential_post | no_score_prob | opp_fg_prob | opp_safety_prob | opp_td_prob | fg_prob | safety_prob | td_prob | extra_point_prob | two_point_conversion_prob | ep | epa | total_home_epa | total_away_epa | total_home_rush_epa | total_away_rush_epa | total_home_pass_epa | total_away_pass_epa | air_epa | yac_epa | comp_air_epa | comp_yac_epa | total_home_comp_air_epa | total_away_comp_air_epa | total_home_comp_yac_epa | total_away_comp_yac_epa | total_home_raw_air_epa | total_away_raw_air_epa | total_home_raw_yac_epa | total_away_raw_yac_epa | wp | def_wp | home_wp | away_wp | wpa | vegas_wpa | vegas_home_wpa | home_wp_post | away_wp_post | vegas_wp | vegas_home_wp | total_home_rush_wpa | total_away_rush_wpa | total_home_pass_wpa | total_away_pass_wpa | air_wpa | yac_wpa | comp_air_wpa | comp_yac_wpa | total_home_comp_air_wpa | total_away_comp_air_wpa | total_home_comp_yac_wpa | total_away_comp_yac_wpa | total_home_raw_air_wpa | total_away_raw_air_wpa | total_home_raw_yac_wpa | total_away_raw_yac_wpa | punt_blocked | first_down_rush | first_down_pass | first_down_penalty | third_down_converted | third_down_failed | fourth_down_converted | fourth_down_failed | incomplete_pass | touchback | interception | punt_inside_twenty | punt_in_endzone | punt_out_of_bounds | punt_downed | punt_fair_catch | kickoff_inside_twenty | kickoff_in_endzone | kickoff_out_of_bounds | kickoff_downed | kickoff_fair_catch | fumble_forced | fumble_not_forced | fumble_out_of_bounds | solo_tackle | safety | penalty | tackled_for_loss | fumble_lost | own_kickoff_recovery | own_kickoff_recovery_td | qb_hit | rush_attempt | pass_attempt | sack | touchdown | pass_touchdown | rush_touchdown | return_touchdown | extra_point_attempt | two_point_attempt | field_goal_attempt | kickoff_attempt | punt_attempt | fumble | complete_pass | assist_tackle | lateral_reception | lateral_rush | lateral_return | lateral_recovery | passer_player_id | passer_player_name | passing_yards | receiver_player_id | receiver_player_name | receiving_yards | rusher_player_id | rusher_player_name | rushing_yards | lateral_receiver_player_id | lateral_receiver_player_name | lateral_receiving_yards | lateral_rusher_player_id | lateral_rusher_player_name | lateral_rushing_yards | lateral_sack_player_id | lateral_sack_player_name | interception_player_id | interception_player_name | lateral_interception_player_id | lateral_interception_player_name | punt_returner_player_id | punt_returner_player_name | lateral_punt_returner_player_id | lateral_punt_returner_player_name | kickoff_returner_player_name | kickoff_returner_player_id | lateral_kickoff_returner_player_id | lateral_kickoff_returner_player_name | punter_player_id | punter_player_name | kicker_player_name | kicker_player_id | own_kickoff_recovery_player_id | own_kickoff_recovery_player_name | blocked_player_id | blocked_player_name | tackle_for_loss_1_player_id | tackle_for_loss_1_player_name | tackle_for_loss_2_player_id | tackle_for_loss_2_player_name | qb_hit_1_player_id | qb_hit_1_player_name | qb_hit_2_player_id | qb_hit_2_player_name | forced_fumble_player_1_team | forced_fumble_player_1_player_id | forced_fumble_player_1_player_name | forced_fumble_player_2_team | forced_fumble_player_2_player_id | forced_fumble_player_2_player_name | solo_tackle_1_team | solo_tackle_2_team | solo_tackle_1_player_id | solo_tackle_2_player_id | solo_tackle_1_player_name | solo_tackle_2_player_name | assist_tackle_1_player_id | assist_tackle_1_player_name | assist_tackle_1_team | assist_tackle_2_player_id | assist_tackle_2_player_name | assist_tackle_2_team | assist_tackle_3_player_id | assist_tackle_3_player_name | assist_tackle_3_team | assist_tackle_4_player_id | assist_tackle_4_player_name | assist_tackle_4_team | tackle_with_assist | tackle_with_assist_1_player_id | tackle_with_assist_1_player_name | tackle_with_assist_1_team | tackle_with_assist_2_player_id | tackle_with_assist_2_player_name | tackle_with_assist_2_team | pass_defense_1_player_id | pass_defense_1_player_name | pass_defense_2_player_id | pass_defense_2_player_name | fumbled_1_team | fumbled_1_player_id | fumbled_1_player_name | fumbled_2_player_id | fumbled_2_player_name | fumbled_2_team | fumble_recovery_1_team | fumble_recovery_1_yards | fumble_recovery_1_player_id | fumble_recovery_1_player_name | fumble_recovery_2_team | fumble_recovery_2_yards | fumble_recovery_2_player_id | fumble_recovery_2_player_name | sack_player_id | sack_player_name | half_sack_1_player_id | half_sack_1_player_name | half_sack_2_player_id | half_sack_2_player_name | return_team | return_yards | penalty_team | penalty_player_id | penalty_player_name | penalty_yards | replay_or_challenge | replay_or_challenge_result | penalty_type | defensive_two_point_attempt | defensive_two_point_conv | defensive_extra_point_attempt | defensive_extra_point_conv | safety_player_name | safety_player_id | season | cp | cpoe | series | series_success | series_result | order_sequence | start_time | time_of_day | stadium | weather | nfl_api_id | play_clock | play_deleted | play_type_nfl | special_teams_play | st_play_type | end_clock_time | end_yard_line | fixed_drive | fixed_drive_result | drive_real_start_time | drive_play_count | drive_time_of_possession | drive_first_downs | drive_inside20 | drive_ended_with_score | drive_quarter_start | drive_quarter_end | drive_yards_penalized | drive_start_transition | drive_end_transition | drive_game_clock_start | drive_game_clock_end | drive_start_yard_line | drive_end_yard_line | drive_play_id_started | drive_play_id_ended | away_score | home_score | location | result | total | spread_line | total_line | div_game | roof | surface | temp | wind | home_coach | away_coach | stadium_id | game_stadium | aborted_play | success | passer | passer_jersey_number | rusher | rusher_jersey_number | receiver | receiver_jersey_number | pass | rush | first_down | special | play | passer_id | rusher_id | receiver_id | name | jersey_number | id | fantasy_player_name | fantasy_player_id | fantasy | fantasy_id | out_of_bounds | home_opening_kickoff | qb_epa | xyac_epa | xyac_mean_yardage | xyac_median_yardage | xyac_success | xyac_fd | xpass | pass_oe | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 25978 | 1.0 | 2022_10_WAS_PHI | 2.022111e+09 | PHI | WAS | REG | 10.0 | NaN | NaN | NaN | NaN | NaN | 2022-11-14 | 900.0 | 1800.0 | 3600.0 | Half1 | 0.0 | NaN | 0.0 | 1.0 | NaN | 0.0 | 15:00 | PHI 35 | 0.0 | NaN | GAME | NaN | NaN | 0.0 | 0.0 | NaN | 0.0 | 0.0 | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 3.0 | 3.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.770222 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | NaN | NaN | NaN | NaN | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.00000 | 0.00000 | 0.433208 | 0.566792 | 0.566792 | 0.433208 | 0.000000 | 0.000000 | 0.000000 | NaN | NaN | 0.162191 | 0.837809 | 0.000000 | 0.000000 | 0.00000 | 0.00000 | NaN | NaN | NaN | NaN | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.00000 | 0.00000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2022.0 | NaN | NaN | 1.0 | 1.0 | First down | 1.0 | 11/14/22, 20:15:23 | NaN | Lincoln Financial Field | Clear Temp: 40° F, Humidity: 51%, Wind: NNW 3 mph | 9574c667-d24c-11ec-b23d-d15a91047884 | 0.0 | 0.0 | GAME_START | 0.0 | NaN | NaN | NaN | 1.0 | Turnover | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 32.0 | 21.0 | Home | -11.0 | 53.0 | 11.0 | 43.0 | 1.0 | outdoors | grass | 43.0 | 6.0 | Nick Sirianni | Ron Rivera | PHI00 | Lincoln Financial Field | 0.0 | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | 0.0 | NaN | 0.0 | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | 0.0 | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 25979 | 41.0 | 2022_10_WAS_PHI | 2.022111e+09 | PHI | WAS | REG | 10.0 | WAS | away | PHI | PHI | 35.0 | 2022-11-14 | 900.0 | 1800.0 | 3600.0 | Half1 | 0.0 | 1.0 | 0.0 | 1.0 | NaN | 0.0 | 15:00 | PHI 35 | 0.0 | 10.0 | 4-J.Elliott kicks 63 yards from PHI 35 to WAS ... | kickoff | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 63.0 | NaN | NaN | 3.0 | 3.0 | 0.0 | NaN | NaN | NaN | NaN | 3.0 | 3.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.004568 | 0.143585 | 0.002325 | 0.275986 | 0.215226 | 0.003265 | 0.355046 | 0.0 | 0.0 | 0.770222 | -1.217267 | 1.217267 | -1.217267 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | NaN | NaN | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.00000 | 0.00000 | 0.433208 | 0.566792 | 0.566792 | 0.433208 | -0.021700 | -0.022272 | 0.022272 | 0.588493 | 0.411507 | 0.162191 | 0.837809 | 0.000000 | 0.000000 | 0.00000 | 0.00000 | NaN | NaN | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.00000 | 0.00000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | A.Gibson | 00-0036328 | NaN | NaN | NaN | NaN | J.Elliott | 00-0033787 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 00-0037615 | N.Dean | PHI | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1.0 | 00-0034623 | A.Chachere | PHI | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | WAS | 14.0 | WAS | 00-0037168 | A.Rogers | 8.0 | 0.0 | NaN | Offensive Holding | 0.0 | 0.0 | 0.0 | 0.0 | NaN | NaN | 2022.0 | NaN | NaN | 1.0 | 1.0 | First down | 41.0 | 11/14/22, 20:15:23 | 2022-11-15T01:15:23Z | Lincoln Financial Field | Clear Temp: 40° F, Humidity: 51%, Wind: NNW 3 mph | 9574c667-d24c-11ec-b23d-d15a91047884 | 0.0 | 0.0 | KICK_OFF | 1.0 | NaN | NaN | NaN | 1.0 | Turnover | 2022-11-15T01:15:23Z | 4.0 | 1:48 | 1.0 | 0.0 | 0.0 | 1.0 | 1.0 | 15.0 | KICKOFF | FUMBLE | 15:00 | 13:12 | WAS 8 | WAS 28 | 41.0 | 174.0 | 32.0 | 21.0 | Home | -11.0 | 53.0 | 11.0 | 43.0 | 1.0 | outdoors | grass | 43.0 | 6.0 | Nick Sirianni | Ron Rivera | PHI00 | Lincoln Financial Field | 0.0 | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | 0.0 | -1.217267 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 25980 | 74.0 | 2022_10_WAS_PHI | 2.022111e+09 | PHI | WAS | REG | 10.0 | WAS | away | PHI | WAS | 92.0 | 2022-11-14 | 892.0 | 1792.0 | 3592.0 | Half1 | 0.0 | 1.0 | 0.0 | 1.0 | 1.0 | 0.0 | 14:52 | WAS 8 | 10.0 | 10.0 | (14:52) (Shotgun) 8-B.Robinson up the middle t... | run | 3.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | NaN | NaN | NaN | NaN | middle | NaN | NaN | NaN | NaN | NaN | 3.0 | 3.0 | 0.0 | NaN | NaN | NaN | NaN | 3.0 | 3.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.004857 | 0.194346 | 0.014364 | 0.326385 | 0.190744 | 0.001588 | 0.267716 | 0.0 | 0.0 | -0.447044 | -0.289755 | 1.507021 | -1.507021 | 0.289755 | -0.289755 | 0.000000 | 0.000000 | NaN | NaN | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.00000 | 0.00000 | 0.411507 | 0.588493 | 0.588493 | 0.411507 | -0.006098 | -0.003941 | 0.003941 | 0.594590 | 0.405410 | 0.139919 | 0.860081 | 0.006098 | -0.006098 | 0.00000 | 0.00000 | NaN | NaN | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.00000 | 0.00000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | 00-0037746 | B.Robinson | 3.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 00-0034381 | J.Sweat | PHI | 00-0036920 | M.Tuipulotu | PHI | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | NaN | NaN | NaN | NaN | 0.0 | NaN | NaN | 0.0 | 0.0 | 0.0 | 0.0 | NaN | NaN | 2022.0 | NaN | NaN | 1.0 | 1.0 | First down | 74.0 | 11/14/22, 20:15:23 | 2022-11-15T01:16:37Z | Lincoln Financial Field | Clear Temp: 40° F, Humidity: 51%, Wind: NNW 3 mph | 9574c667-d24c-11ec-b23d-d15a91047884 | 0.0 | 0.0 | RUSH | 0.0 | NaN | NaN | NaN | 1.0 | Turnover | 2022-11-15T01:15:23Z | 4.0 | 1:48 | 1.0 | 0.0 | 0.0 | 1.0 | 1.0 | 15.0 | KICKOFF | FUMBLE | 15:00 | 13:12 | WAS 8 | WAS 28 | 41.0 | 174.0 | 32.0 | 21.0 | Home | -11.0 | 53.0 | 11.0 | 43.0 | 1.0 | outdoors | grass | 43.0 | 6.0 | Nick Sirianni | Ron Rivera | PHI00 | Lincoln Financial Field | 0.0 | 0.0 | NaN | NaN | B.Robinson | 8.0 | NaN | NaN | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | NaN | 00-0037746 | NaN | B.Robinson | 8.0 | 00-0037746 | B.Robinson | 00-0037746 | B.Robinson | 00-0037746 | 0.0 | 0.0 | -0.289755 | NaN | NaN | NaN | NaN | NaN | 0.361042 | -36.104223 |
| 25981 | 95.0 | 2022_10_WAS_PHI | 2.022111e+09 | PHI | WAS | REG | 10.0 | WAS | away | PHI | WAS | 89.0 | 2022-11-14 | 859.0 | 1759.0 | 3559.0 | Half1 | 0.0 | 1.0 | 0.0 | 1.0 | 2.0 | 0.0 | 14:19 | WAS 11 | 7.0 | 10.0 | (14:19) (Shotgun) 8-B.Robinson right guard to ... | run | 2.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | NaN | NaN | NaN | NaN | right | guard | NaN | NaN | NaN | NaN | 3.0 | 3.0 | 0.0 | NaN | NaN | NaN | NaN | 3.0 | 3.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.004976 | 0.203961 | 0.005102 | 0.350961 | 0.172904 | 0.002273 | 0.259823 | 0.0 | 0.0 | -0.736799 | -0.448021 | 1.955042 | -1.955042 | 0.737775 | -0.737775 | 0.000000 | 0.000000 | NaN | NaN | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.00000 | 0.00000 | 0.405410 | 0.594590 | 0.594590 | 0.405410 | -0.021418 | -0.001524 | 0.001524 | 0.616008 | 0.383992 | 0.135978 | 0.864022 | 0.027516 | -0.027516 | 0.00000 | 0.00000 | NaN | NaN | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.00000 | 0.00000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | 00-0037746 | B.Robinson | 2.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 00-0036920 | M.Tuipulotu | PHI | 00-0029653 | F.Cox | PHI | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | NaN | NaN | NaN | NaN | 0.0 | NaN | NaN | 0.0 | 0.0 | 0.0 | 0.0 | NaN | NaN | 2022.0 | NaN | NaN | 1.0 | 1.0 | First down | 95.0 | 11/14/22, 20:15:23 | 2022-11-15T01:17:10Z | Lincoln Financial Field | Clear Temp: 40° F, Humidity: 51%, Wind: NNW 3 mph | 9574c667-d24c-11ec-b23d-d15a91047884 | 0.0 | 0.0 | RUSH | 0.0 | NaN | NaN | NaN | 1.0 | Turnover | 2022-11-15T01:15:23Z | 4.0 | 1:48 | 1.0 | 0.0 | 0.0 | 1.0 | 1.0 | 15.0 | KICKOFF | FUMBLE | 15:00 | 13:12 | WAS 8 | WAS 28 | 41.0 | 174.0 | 32.0 | 21.0 | Home | -11.0 | 53.0 | 11.0 | 43.0 | 1.0 | outdoors | grass | 43.0 | 6.0 | Nick Sirianni | Ron Rivera | PHI00 | Lincoln Financial Field | 0.0 | 0.0 | NaN | NaN | B.Robinson | 8.0 | NaN | NaN | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | NaN | 00-0037746 | NaN | B.Robinson | 8.0 | 00-0037746 | B.Robinson | 00-0037746 | B.Robinson | 00-0037746 | 0.0 | 0.0 | -0.448021 | NaN | NaN | NaN | NaN | NaN | 0.556989 | -55.698919 |
| 25982 | 116.0 | 2022_10_WAS_PHI | 2.022111e+09 | PHI | WAS | REG | 10.0 | WAS | away | PHI | WAS | 87.0 | 2022-11-14 | 817.0 | 1717.0 | 3517.0 | Half1 | 0.0 | 1.0 | 0.0 | 1.0 | 3.0 | 0.0 | 13:37 | WAS 13 | 5.0 | 10.0 | (13:37) (Shotgun) 4-T.Heinicke pass incomplete... | pass | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | deep | left | 20.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 3.0 | 3.0 | 0.0 | NaN | NaN | NaN | NaN | 3.0 | 3.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.005277 | 0.234588 | 0.003712 | 0.366850 | 0.154731 | 0.002753 | 0.232089 | 0.0 | 0.0 | -1.184820 | -1.298478 | 3.253520 | -3.253520 | 0.737775 | -0.737775 | 1.298478 | -1.298478 | 2.561982 | -3.86046 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | -2.561982 | 2.561982 | 3.86046 | -3.86046 | 0.383992 | 0.616008 | 0.616008 | 0.383992 | -0.032710 | -0.015145 | 0.015145 | 0.648718 | 0.351282 | 0.134454 | 0.865546 | 0.027516 | -0.027516 | 0.03271 | -0.03271 | 0.0 | -0.03271 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.03271 | -0.03271 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 00-0031800 | T.Heinicke | NaN | 00-0033282 | C.Samuel | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | 00-0036303 | J.Scott | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | NaN | NaN | NaN | NaN | 0.0 | NaN | NaN | 0.0 | 0.0 | 0.0 | 0.0 | NaN | NaN | 2022.0 | 0.407574 | -40.757403 | 1.0 | 1.0 | First down | 116.0 | 11/14/22, 20:15:23 | 2022-11-15T01:17:52Z | Lincoln Financial Field | Clear Temp: 40° F, Humidity: 51%, Wind: NNW 3 mph | 9574c667-d24c-11ec-b23d-d15a91047884 | 0.0 | 0.0 | PASS | 0.0 | NaN | NaN | NaN | 1.0 | Turnover | 2022-11-15T01:15:23Z | 4.0 | 1:48 | 1.0 | 0.0 | 0.0 | 1.0 | 1.0 | 15.0 | KICKOFF | FUMBLE | 15:00 | 13:12 | WAS 8 | WAS 28 | 41.0 | 174.0 | 32.0 | 21.0 | Home | -11.0 | 53.0 | 11.0 | 43.0 | 1.0 | outdoors | grass | 43.0 | 6.0 | Nick Sirianni | Ron Rivera | PHI00 | Lincoln Financial Field | 0.0 | 0.0 | T.Heinicke | 4.0 | NaN | NaN | C.Samuel | 10.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 00-0031800 | NaN | 00-0033282 | T.Heinicke | 4.0 | 00-0031800 | C.Samuel | 00-0033282 | C.Samuel | 00-0033282 | 0.0 | 0.0 | -1.298478 | 0.388345 | 5.862539 | 3.0 | 1.0 | 1.0 | 0.950282 | 4.971772 |
eagles_loss_to_wash_df = play_by_play_2022_df[play_by_play_2022_df.game_id == "2022_10_WAS_PHI"]
for desc in eagles_loss_to_wash_df[eagles_loss_to_wash_df.drive == 1.0].desc:
print(desc)
print("===")
4-J.Elliott kicks 63 yards from PHI 35 to WAS 2. 24-A.Gibson to WAS 43 for 41 yards (21-A.Chachere, 17-N.Dean). PENALTY on WAS-88-A.Rogers, Offensive Holding, 8 yards, enforced at WAS 16. === (14:52) (Shotgun) 8-B.Robinson up the middle to WAS 11 for 3 yards (94-J.Sweat; 95-M.Tuipulotu). === (14:19) (Shotgun) 8-B.Robinson right guard to WAS 13 for 2 yards (95-M.Tuipulotu; 91-F.Cox). === (13:37) (Shotgun) 4-T.Heinicke pass incomplete deep left to 10-C.Samuel (33-J.Scott). === (13:32) 5-T.Way punts 47 yards to PHI 40, Center-54-C.Cheeseman. 18-B.Covey to PHI 48 for 8 yards (58-S.Toney; 39-J.Reaves). PENALTY on PHI-32-R.Blankenship, Roughing the Kicker, 15 yards, enforced at WAS 13 - No Play. === (13:20) (Shotgun) 4-T.Heinicke sacked at WAS 18 for -10 yards (94-J.Sweat). FUMBLES (94-J.Sweat) [94-J.Sweat], RECOVERED by PHI-95-M.Tuipulotu at WAS 18. ===
pass_playes_in_2022 = play_by_play_2022_df[
(play_by_play_2022_df.play_type == "pass")
# & (play_by_play_2023_df.sack == 1.0)
]
# sacks_in_2022.head()
# We ignore where down is NaN which is in the case of extra point attempts
alt.Chart(pass_playes_in_2022[~pass_playes_in_2022.down.isna()][["yardline_100", "sack", "down"]]).mark_bar().encode(
x=alt.X("yardline_100:Q", bin=alt.Bin(maxbins=20)),
y=alt.Y("mean(sack):Q", title="Percentage of pass plays resulting in a sack").stack(False),
color=alt.Color("down:N", title="Down"),
column=alt.Column("down:N", title="Down"),
opacity=alt.value(0.75)
).properties(
title=alt.Title("Percentage of pass plays resulting in a sack at a given yard line (Only 2022)", fontSize=25)
)
This visual interestingly enough shows some pretty great patterns. We see that on first and second down there is almost always about a $5\%$ chance of a pass play being a sack, however it nearly averages around $10\%$ for $3^\text{rd}$ down, while $4^\text{th}$ down has the most variance across the field with peaks near each teams 20-30 yardlines.
Build Predictive Models¶
- Build a model based on these simple circumstantial features. Models to try include
- Logistic Regression
- Naive Bayes
- RandomForestClassifier
- XGBoost
- It is important to note that I will be training binary class classification algorithms with the goal of being able to extract information from the classifier about the probability of a play resulting in a sack given the set if input features.
- Evaluate the model
- Expand on it with more complex input features about the team or players statistics
helpful_fields = ["yardline_100", "quarter_seconds_remaining", "qtr", "down", "ydstogo", "sack", "season"]
predictive_fields = ["yardline_100", "quarter_seconds_remaining", "qtr", "down", "ydstogo", "sack"]
- I'm going to move forward assuming I am given the knowledge that it is going to be a pass play. However, in the future it would be great to extend this model to try and make these predictions based on any play as it would be in reality.
- Choosing to rule out 2 point conversions for the current version of these models
def load_all_season_passing_plays() -> pd.DataFrame:
"""
Load all passing plays from 2021 to 2023.
:return: DataFrame containing all passing plays.
"""
play_by_play_df = pd.DataFrame()
for year in range(2021, 2024):
cur_year_pbp_df = pd.read_csv(
f"data/play_by_play_{year}.csv",
header=0,
low_memory=False
)
play_by_play_df = pd.concat([play_by_play_df, cur_year_pbp_df], ignore_index=True)
# Passing plays, non 2 point conversions modeling choice, could be changed later
passing_plays_df = play_by_play_df[
(play_by_play_df.play_type == "pass")
& (~play_by_play_df.down.isna()) # My EDA revealed that the pass plays which have null down values are 2 point conversion attmepts.
]
return passing_plays_df
passing_plays_df = load_all_season_passing_plays()
# passing_plays_df.head()
alt.Chart(passing_plays_df[helpful_fields]).mark_bar().encode(
x=alt.X("yardline_100:Q", bin=alt.Bin(maxbins=20)),
y=alt.Y("mean(sack):Q", title="Percentage of pass plays resulting in a sack").stack(False),
color=alt.Color("down:N", title="Down"),
column=alt.Column("down:N", title="Down"),
row=alt.Row("season:O", title="Quarter"),
# facet=alt.Facet("defteam:N", columns=8, title="Defensive Team"),
opacity=alt.value(0.75)
).properties(
title=alt.Title("Percentage of pass plays resulting in a sack at a given yard line in a given season", fontSize=25)
)
This is mostly a repeat of the previous chart except we can no compare from year to year. In conclusion there do not seem to be any large noteworthy trends from year to year for the 3 years of data we have.
alt.Chart(passing_plays_df[predictive_fields]).mark_bar().encode(
x=alt.X("yardline_100:Q", bin=alt.Bin(maxbins=20)),
y=alt.Y("mean(sack):Q", title="Percentage of pass plays resulting in a sack").stack(False),
color=alt.Color("down:N", title="Down"),
column=alt.Column("down:N", title="Down"),
# facet=alt.Facet("defteam:N", columns=8, title="Defensive Team"),
opacity=alt.value(0.75)
).properties(
title=alt.Title("Percentage of pass plays resulting in a sack at a given yard line (2021-2023)", fontSize=25)
)
Once again looking at a similar plot as before only now we are aggregating across the 3 seasons of data that we have at once.
Batch of feature engineering functions for prepping data and saving model experiment details¶
def prepare_data_for_training(
passing_plays_df: pd.DataFrame,
predictive_fields: list[str],
fields_to_encode: list[str],
do_standard_scale: bool = True,
label_field: str = "sack",
) -> pd.DataFrame:
"""
Prepare the passing plays DataFrame for training by selecting predictive fields,
encoding categorical fields, and optionally standard scaling the data.
:param passing_plays_df: DataFrame containing passing plays data.
:param predictive_fields: List of fields to use as predictors.
:param fields_to_encode: List of fields to encode using one-hot encoding.
:param do_standard_scale: Whether to standard scale the data.
:param label_field: The field to use as the label for training (default is "sack").
:return: Prepared DataFrame ready for training.
"""
if not set(fields_to_encode).issubset(set(predictive_fields)):
raise ValueError(
f"Fields to encode {fields_to_encode} must be a subset of predictive fields {predictive_fields}"
)
passing_plays_subset_df = passing_plays_df[predictive_fields]
passing_plays_subset_df = passing_plays_subset_df.astype({"down": int}) # hard coded for now, fix later
passing_plays_subset_df = pd.get_dummies(
passing_plays_subset_df,
columns=fields_to_encode,
dtype=int
)
if do_standard_scale:
temp_df = passing_plays_subset_df.copy()
temp_df = temp_df.drop(columns=[label_field])
scaler = StandardScaler()
scaled_data = scaler.fit_transform(temp_df)
temp_df = pd.DataFrame(
scaled_data,
columns=temp_df.columns,
)
temp_df[label_field] = passing_plays_subset_df[label_field].values
passing_plays_subset_df = temp_df.copy()
return passing_plays_subset_df
def get_training_test_sets(
prepared_df: pd.DataFrame,
) -> tuple[pd.DataFrame, pd.Series, pd.DataFrame, pd.Series]:
"""
Split the prepared DataFrame into training and test sets.
:param prepared_df: DataFrame prepared for training, containing features and labels.
:return: Tuple containing training features, training labels, test features, and test labels.
"""
x_train, x_test, y_train, y_test = train_test_split(
prepared_df.drop(columns=["sack"]),
prepared_df["sack"],
test_size=0.2,
random_state=42
)
return x_train, y_train, x_test, y_test
def build_model_record(
model_id: int,
model_name: str,
model: object,
x_test: pd.DataFrame,
y_test: pd.Series,
desc: Optional[str] = None,
standard_scaled: bool = False
) -> dict:
"""
Build a record of the model's performance metrics.
:param model_id: Unique identifier for the model.
:param model_name: Name of the model.
:param model: The trained model object.
:param x_test: Test features DataFrame.
:param y_test: Test labels Series.
:param desc: Optional description of the model.
:param standard_scaled: Whether the data was standard scaled.
:return: Dictionary containing model performance metrics.
"""
y_predict = model.predict(x_test)
accuracy_curr = accuracy_score(y_test, y_predict)
precision_curr = precision_score(y_test, y_predict, zero_division=0)
recall_curr = recall_score(y_test, y_predict, zero_division=0)
f1_curr = f1_score(y_test, y_predict, zero_division=0)
if hasattr(model, 'class_weight'):
class_weighting = bool(model.class_weight)
elif hasattr(model, 'scale_pos_weight'):
class_weighting = bool(model.scale_pos_weight)
else:
# Only GuassianNB should get here
class_weighting = True
model_record = {
"model_id": model_id,
"model": model_name,
"desc": desc,
"accuracy": accuracy_curr,
"precision": precision_curr,
"recall": recall_curr,
"f1_score": f1_curr,
"standard_scaled": standard_scaled,
"class_weighting": class_weighting,
}
return model_record
def record_model_results(
model_performance_df: pd.DataFrame,
model_name: str,
model: object,
x_test: pd.DataFrame,
y_test: pd.Series,
desc: Optional[str] = None,
standard_scaled: bool = False
) -> pd.DataFrame:
"""
Record the results of a model's performance and update the model performance DataFrame.
:param model_performance_df: DataFrame to store model performance records.
:param model_name: Name of the model.
:param model: The trained model object.
:param x_test: Test features DataFrame.
:param y_test: Test labels Series.
:param desc: Optional description of the model.
:param standard_scaled: Whether the data was standard scaled.
:return: Updated model performance DataFrame with the new model record.
"""
model_id = len(model_performance_df)
model_record = build_model_record(
model_id=model_id,
model_name=model_name,
model=model,
x_test=x_test,
y_test=y_test,
desc=desc,
standard_scaled=standard_scaled
)
new_rows = [model_record]
new_df = pd.DataFrame(new_rows)
model_performance_df = pd.concat([model_performance_df, new_df], ignore_index=True)
return model_performance_df
# passing_plays_df should already be loaded
# passing_plays_df = load_all_season_passing_plays()
prepared_df = prepare_data_for_training(
passing_plays_df=passing_plays_df,
predictive_fields=predictive_fields,
fields_to_encode=["down", "qtr"],
do_standard_scale=False,
)
prepared_df.head()
| yardline_100 | quarter_seconds_remaining | ydstogo | sack | down_1 | down_2 | down_3 | down_4 | qtr_1 | qtr_2 | qtr_3 | qtr_4 | qtr_5 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 3 | 78.0 | 863.0 | 13 | 0.0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| 4 | 75.0 | 822.0 | 10 | 0.0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 |
| 6 | 61.0 | 807.0 | 10 | 0.0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| 8 | 31.0 | 746.0 | 18 | 0.0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| 9 | 30.0 | 714.0 | 17 | 0.0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
x_train, y_train, x_test, y_test = get_training_test_sets(prepared_df)
lr_model = LogisticRegression(
penalty="l2",
solver="lbfgs",
max_iter=1000,
random_state=42,
)
lr_model.fit(x_train, y_train)
LogisticRegression(max_iter=1000, random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression(max_iter=1000, random_state=42)
model_performance_df = pd.DataFrame()
model_performance_df = record_model_results(
model_performance_df=model_performance_df,
model_name="Base Logistic Regression",
model=lr_model,
x_test=x_test,
y_test=y_test,
desc=f"Logistic Regression model trained on {', '.join(lr_model.feature_names_in_)} features with no standard scaling.",
standard_scaled=False
)
model_performance_df
| model_id | model | desc | accuracy | precision | recall | f1_score | standard_scaled | class_weighting | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | Base Logistic Regression | Logistic Regression model trained on yardline_... | 0.936939 | 0.0 | 0.0 | 0.0 | False | False |
Initial Model Training Reflections (Class Imbalance)¶
After seeing the first model attempted result in an accuracy of $93.69\%$ one might be extremely excited, however a closer look at the predictions from this model reveals that the model figured out that always predicting the outcome will not be a sack results in such a high accuracy without every actually correctly predicting a sack when it occurs.
The term for this scenario we find ourselves in now is called a class imbalance, where in our classification problem the cases where the outcome of interest (a sack) acutally occurs are few and far between therefore in our dataset we have mostly non-sack plays with only about $~6\%$ (unsurprisingly the compliment of our accuracy when predicing all non sack outcomes) of plays resulting in a sack.
We now get to come up with ways of mitigating this issue. The primary means we will attempt is providing information about the class imbalance to the models ahead of or during the trainin process. Secondly, it means we need a metric other than accuracy to evaluate the performance of the model. Common candidates include the scores, precision, recall, and the F1 score. Depending on goals and use cases for the model tells us which one of these best fits our desired outcomes, but in any case let's keep track of each of them so we can compare the models in a more nuanced way.
prepared_scaled_df = prepare_data_for_training(
passing_plays_df=passing_plays_df,
predictive_fields=predictive_fields,
fields_to_encode=["down", "qtr"],
do_standard_scale=True,
)
x_train, y_train, x_test, y_test = get_training_test_sets(prepared_scaled_df)
lr_model_2 = LogisticRegression(
penalty="l2",
solver="lbfgs",
max_iter=1000,
random_state=42,
)
lr_model_2.fit(x_train, y_train)
model_performance_df = record_model_results(
model_performance_df=model_performance_df,
model_name="Base Logistic Regression",
model=lr_model_2,
x_test=x_test,
y_test=y_test,
desc=f"Logistic Regression model trained on {', '.join(lr_model.feature_names_in_)} features with a standard scaler applied.",
standard_scaled=True
)
class_weight = y_train.mean()
lr_model_3 = LogisticRegression(
penalty="l2",
solver="lbfgs",
max_iter=1000,
random_state=42,
class_weight={1.0: 1 - class_weight, 0.0: class_weight},
)
lr_model_3.fit(x_train, y_train)
model_performance_df = record_model_results(
model_performance_df=model_performance_df,
model_name="Base Logistic Regression",
model=lr_model_3,
x_test=x_test,
y_test=y_test,
desc=f"Logistic Regression model trained on {', '.join(lr_model.feature_names_in_)} features with a standard scaler applied.",
standard_scaled=False
)
predictive_fields = ["yardline_100", "quarter_seconds_remaining", "qtr", "down", "ydstogo", "sack"]
extended_predictive_fields = predictive_fields + ["defteam", "posteam"]
prepared_extended_df = prepare_data_for_training(
passing_plays_df=passing_plays_df,
predictive_fields=extended_predictive_fields,
fields_to_encode=["down", "qtr", "defteam", "posteam"],
do_standard_scale=False,
)
prepared_extended_df.head()
| yardline_100 | quarter_seconds_remaining | ydstogo | sack | down_1 | down_2 | down_3 | down_4 | qtr_1 | qtr_2 | qtr_3 | qtr_4 | qtr_5 | defteam_ARI | defteam_ATL | defteam_BAL | defteam_BUF | defteam_CAR | defteam_CHI | defteam_CIN | defteam_CLE | defteam_DAL | defteam_DEN | defteam_DET | defteam_GB | defteam_HOU | defteam_IND | defteam_JAX | defteam_KC | defteam_LA | defteam_LAC | defteam_LV | defteam_MIA | defteam_MIN | defteam_NE | defteam_NO | defteam_NYG | defteam_NYJ | defteam_PHI | defteam_PIT | defteam_SEA | defteam_SF | defteam_TB | defteam_TEN | defteam_WAS | posteam_ARI | posteam_ATL | posteam_BAL | posteam_BUF | posteam_CAR | posteam_CHI | posteam_CIN | posteam_CLE | posteam_DAL | posteam_DEN | posteam_DET | posteam_GB | posteam_HOU | posteam_IND | posteam_JAX | posteam_KC | posteam_LA | posteam_LAC | posteam_LV | posteam_MIA | posteam_MIN | posteam_NE | posteam_NO | posteam_NYG | posteam_NYJ | posteam_PHI | posteam_PIT | posteam_SEA | posteam_SF | posteam_TB | posteam_TEN | posteam_WAS | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 3 | 78.0 | 863.0 | 13 | 0.0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 4 | 75.0 | 822.0 | 10 | 0.0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 6 | 61.0 | 807.0 | 10 | 0.0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 8 | 31.0 | 746.0 | 18 | 0.0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 9 | 30.0 | 714.0 | 17 | 0.0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
x_train, y_train, x_test, y_test = get_training_test_sets(prepared_extended_df)
lr_model_4 = LogisticRegression(
penalty="l2",
solver="lbfgs",
max_iter=1000,
random_state=42,
)
lr_model_4.fit(x_train, y_train)
model_performance_df = record_model_results(
model_performance_df=model_performance_df,
model_name="Extended Logistic Regression",
model=lr_model_4,
x_test=x_test,
y_test=y_test,
desc=f"Logistic Regression model trained on {', '.join(lr_model.feature_names_in_)} features with no standard scaling.",
standard_scaled=False
)
lr_model_5 = LogisticRegression(
penalty="l2",
solver="lbfgs",
max_iter=1000,
random_state=42,
class_weight={1.0: 1 - class_weight, 0.0: class_weight},
)
lr_model_5.fit(x_train, y_train)
model_performance_df = record_model_results(
model_performance_df=model_performance_df,
model_name="Extended Logistic Regression",
model=lr_model_5,
x_test=x_test,
y_test=y_test,
desc=f"Logistic Regression model trained on {', '.join(lr_model.feature_names_in_)} features with no standard scaling.",
standard_scaled=False
)
model_performance_df
| model_id | model | desc | accuracy | precision | recall | f1_score | standard_scaled | class_weighting | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | Base Logistic Regression | Logistic Regression model trained on yardline_... | 0.936939 | 0.000000 | 0.000000 | 0.000000 | False | False |
| 1 | 1 | Base Logistic Regression | Logistic Regression model trained on yardline_... | 0.936939 | 0.000000 | 0.000000 | 0.000000 | True | False |
| 2 | 2 | Base Logistic Regression | Logistic Regression model trained on yardline_... | 0.686717 | 0.092227 | 0.448718 | 0.153005 | False | True |
| 3 | 3 | Extended Logistic Regression | Logistic Regression model trained on yardline_... | 0.936939 | 0.000000 | 0.000000 | 0.000000 | False | False |
| 4 | 4 | Extended Logistic Regression | Logistic Regression model trained on yardline_... | 0.612661 | 0.089121 | 0.557692 | 0.153683 | False | True |
fig, ax = plt.subplots(figsize=(5, 5), dpi=160)
cm = confusion_matrix(y_test, lr_model_5.predict(x_test), labels=[0, 1])
ConfusionMatrixDisplay(cm).plot(colorbar=False, ax=ax)
plt.title("Confusion Matrix for Logistic\nRegression Model using class weighting", fontsize=16)
plt.show()
This cell reveals that logistic regression as a classifier predicts the class which belongs to the higher probability
(lr_model_5.predict_proba(x_test).argmax(axis=1) == lr_model_5.predict(x_test)).all()
np.True_
TP = 435
FN = 345
FP = 4446
TP / (TP + FN), TP / (TP + FP) # Recall and Precision respectively
(0.5576923076923077, 0.08912108174554395)
Describe why we want to maximize recall in our case (Sports betting context)¶
- Recall is more appropriate in our case because we care more about identifying as many of the sacks as sacks as possible knowing that in order to do so we will likely have more false positives which is when we predict non sack plays to result in a sack.
- As I understand it, in sports betting, it behooves the sports book to claim that an outcome is more likely than it really is which reduces the overall return for that bet giving the sports book the margins they are looking for. Honestly, I am still learning more about the industry, but my main takeaway is that we would rather catch as many sacks as we can and we're okay if we end up predicting something will be a sack and it ending up not being one. Hence, we want to maximize recall, which is given by $$\text{recall} = \frac {TP} {TP + FN}$$
x_train, y_train, x_test, y_test = get_training_test_sets(prepared_scaled_df)
gauss_naive_bayes = GaussianNB()
sample_weight = compute_sample_weight(
class_weight={
0: y_train.mean(),
1: 1 - y_train.mean()
},
y=y_train,
)
gauss_naive_bayes.fit(x_train, y_train, sample_weight=sample_weight)
# precision_score(y_test, gauss_naive_bayes.predict(x_test), labels=[0, 1], zero_division=0), recall_score(y_test, gauss_naive_bayes.predict(x_test), labels=[0, 1], zero_division=0)
model_performance_df = record_model_results(
model_performance_df=model_performance_df,
model_name="Naive Bayes",
model=gauss_naive_bayes,
x_test=x_test,
y_test=y_test,
desc=f"Naive Bayes model trained on {', '.join(lr_model.feature_names_in_)} features with a standard scaler applied.",
standard_scaled=True
)
x_train, y_train, x_test, y_test = get_training_test_sets(prepared_extended_df)
gauss_naive_bayes = GaussianNB()
sample_weight = compute_sample_weight(
class_weight={
0: y_train.mean(),
1: 1 - y_train.mean()
},
y=y_train,
)
gauss_naive_bayes.fit(x_train, y_train, sample_weight=sample_weight)
# precision_score(y_test, gauss_naive_bayes.predict(x_test), labels=[0, 1], zero_division=0), recall_score(y_test, gauss_naive_bayes.predict(x_test), labels=[0, 1], zero_division=0)
model_performance_df = record_model_results(
model_performance_df=model_performance_df,
model_name="Extended Naive Bayes",
model=gauss_naive_bayes,
x_test=x_test,
y_test=y_test,
desc=f"Naive Bayes model trained on {', '.join(lr_model.feature_names_in_)} features with no standard scaling.",
standard_scaled=True
)
model_performance_df
| model_id | model | desc | accuracy | precision | recall | f1_score | standard_scaled | class_weighting | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | Base Logistic Regression | Logistic Regression model trained on yardline_... | 0.936939 | 0.000000 | 0.000000 | 0.000000 | False | False |
| 1 | 1 | Base Logistic Regression | Logistic Regression model trained on yardline_... | 0.936939 | 0.000000 | 0.000000 | 0.000000 | True | False |
| 2 | 2 | Base Logistic Regression | Logistic Regression model trained on yardline_... | 0.686717 | 0.092227 | 0.448718 | 0.153005 | False | True |
| 3 | 3 | Extended Logistic Regression | Logistic Regression model trained on yardline_... | 0.936939 | 0.000000 | 0.000000 | 0.000000 | False | False |
| 4 | 4 | Extended Logistic Regression | Logistic Regression model trained on yardline_... | 0.612661 | 0.089121 | 0.557692 | 0.153683 | False | True |
| 5 | 5 | Naive Bayes | Naive Bayes model trained on yardline_100, qua... | 0.688981 | 0.093560 | 0.452564 | 0.155063 | True | True |
| 6 | 6 | Extended Naive Bayes | Naive Bayes model trained on yardline_100, qua... | 0.552429 | 0.076114 | 0.547436 | 0.133646 | True | True |
fig, ax = plt.subplots(figsize=(5, 5), dpi=160)
display = PrecisionRecallDisplay.from_estimator(
gauss_naive_bayes, x_test, y_test, name="Gauss Naive Bayes", plot_chance_level=True, despine=True, ax=ax
)
display.ax_.set_xlabel("Recall")
_ = display.ax_.set_title("2-class Precision-Recall curve")
plt.legend(loc="upper right")
plt.show()
# precision, recall, thresholds = precision_recall_curve(y_test, lr_model.predict_proba(x_test)[:, 1], pos_label=1)
fig, ax = plt.subplots(figsize=(5, 5), dpi=160)
names = ["Logistic Reg", "Logistic Reg with Class Weighting", "Naive Bayes"]
for model, name in zip([lr_model_4, lr_model_5, gauss_naive_bayes], names):
display = PrecisionRecallDisplay.from_estimator(
model, x_test, y_test, name=name, plot_chance_level=True, despine=True, ax=ax
)
display.ax_.set_xlabel("Recall")
_ = display.ax_.set_title("2-class Precision-Recall curve")
plt.legend(loc="upper right")
plt.xlabel("Recall")
plt.ylabel("Precision")
plt.title("Precision-Recall Curve for the Logistic Regression Model")
plt.legend()
plt.show()
Advanced Input Features¶
I want to incorporate the information about individual players who are likely on the field at the time. Information about how many times the quarterback had been sacked the previous year and how many total sacks the team had a previous year. We could also be more spectific about the player level sack totals and gauge the likelihood of a sack based in part on that information.
def load_def_advstats():
"""
Load passer advanced statistics from the CSV file.
:return: DataFrame containing passer advanced statistics.
"""
advstats_df = pd.DataFrame()
for year in range(2021, 2024):
cur_year_advs_df = pd.read_csv(
f"data/advstats_week_def_{year}.csv",
header=0,
)
advstats_df = pd.concat([advstats_df, cur_year_advs_df], ignore_index=True)
return advstats_df
def load_passer_advstats():
"""
Load passer advanced statistics from the CSV file.
:return: DataFrame containing passer advanced statistics.
"""
advstats_df = pd.DataFrame()
for year in range(2021, 2024):
cur_year_advs_df = pd.read_csv(
f"data/advstats_week_pass_{year}.csv",
header=0,
)
advstats_df = pd.concat([advstats_df, cur_year_advs_df], ignore_index=True)
return advstats_df
def estimate_unknown_season_sack_related_data(
df: pd.DataFrame,
fields: list[str],
team_type: str,
prev_season: int = 2020
) -> pd.DataFrame:
"""
Estimate the previous season's sack related data for a year whose data is not available.
:param df: DataFrame containing sack related data.
:param fields: List of fields to estimate.
:param team_type: Type of team to filter by (e.g., "posteam" or "defteam").
:param prev_season: The season to use for estimation (default is 2020).
:return: DataFrame with estimated previous season's sack related data.
"""
temp_df = df.copy()
temp_df.drop(columns=["prev_season"], inplace=True)
temp_df = temp_df.groupby(team_type, as_index=False).mean()
for field in fields:
temp_df[field] = temp_df[field].astype(int) # Cast to integers because they are counts
temp_df["prev_season"] = prev_season
return temp_df
def process_advstats(
advstats_df: pd.DataFrame,
team_type: str,
fields: list[str],
) -> pd.DataFrame:
"""
Process advanced statistics DataFrame by selecting relevant fields.
:param advstats_df: DataFrame containing passer or def advanced statistics.
:param team_type: Type of team to filter by (e.g., "posteam" or "defteam").
:param fields: List of fields to select from the DataFrame.
:return: Processed DataFrame with selected fields.
"""
passer_sack_agg_fields_with_prefix = {
f"prev_szn{('_' + team_type) if team_type == 'posteam' else ''}_{field}": (field, "sum")
for field in fields
}
advstats_df = advstats_df.astype({"season": int})
prev_season_advstats_df = advstats_df.groupby(
by=["team", "season"],
as_index=False
).agg(**passer_sack_agg_fields_with_prefix)
prev_season_advstats_df.rename(
columns={"team": team_type, "season": "prev_season"},
inplace=True,
)
estimated_2020_stats = estimate_unknown_season_sack_related_data(
prev_season_advstats_df,
prev_season_advstats_df.columns.difference([team_type, "prev_season"]),
team_type=team_type,
prev_season=2020
)
prev_season_advstats_df = pd.concat(
[prev_season_advstats_df, estimated_2020_stats],
ignore_index=True
)
return prev_season_advstats_df
def enrich_passing_plays_data_with_prev_szn_stats(
passing_plays_df: pd.DataFrame,
fields: list[str],
) -> pd.DataFrame:
"""
Prepare the passing plays DataFrame for training by merging it with advanced statistics
from the previous season. This includes relevant sack-related statistics for both
the offensive and defensive teams.
:param passing_plays_df: DataFrame containing passing plays data.
:param fields: List of fields to use as predictors.
:return: DataFrame ready for training with advanced statistics merged.
"""
advstats_passer_df = load_passer_advstats()
advstats_def_df = load_def_advstats()
passer_sack_relevant_fields = ["times_sacked", "times_blitzed", "times_hurried", "times_hit", "times_pressured"]
posteam_advstats_df = process_advstats(
advstats_df=advstats_passer_df,
team_type="posteam",
fields=passer_sack_relevant_fields,
)
def_sack_relevant_fields = ["def_times_blitzed", "def_times_hurried", "def_times_hitqb", "def_sacks", "def_pressures"]
defteam_advstats_df = process_advstats(
advstats_df=advstats_def_df,
team_type="defteam",
fields=def_sack_relevant_fields,
)
# Augment the passing plays DataFrame with minimal metadata
augmented_fields = fields + ["game_id"]
passing_plays_with_minimal_metadata = passing_plays_df[augmented_fields].copy()
passing_plays_with_minimal_metadata["prev_season"] = passing_plays_with_minimal_metadata.apply(
lambda row: int(row["game_id"][:4]) - 1,
axis=1
)
passing_plays_with_minimal_metadata.drop(columns=["game_id"], inplace=True)
# Merge the passing plays DataFrame with the advanced statistics DataFrames
passing_plays_with_minimal_metadata = passing_plays_with_minimal_metadata.merge(
posteam_advstats_df,
how="left",
on=["posteam", "prev_season"],
)
passing_plays_with_minimal_metadata = passing_plays_with_minimal_metadata.merge(
defteam_advstats_df,
how="left",
on=["defteam", "prev_season"],
)
passing_plays_with_minimal_metadata.drop(columns=["prev_season"], inplace=True)
return passing_plays_with_minimal_metadata
passing_training_data_df = enrich_passing_plays_data_with_prev_szn_stats(
passing_plays_df=passing_plays_df,
fields=extended_predictive_fields,
)
passing_training_data_df.drop(columns=["posteam", "defteam"], inplace=True)
prepared_enriched_df = prepare_data_for_training(
passing_training_data_df,
passing_training_data_df.columns,
fields_to_encode=["down", "qtr"], # , "defteam", "posteam"],
do_standard_scale=True,
label_field="sack",
)
x_train, y_train, x_test, y_test = get_training_test_sets(prepared_enriched_df)
lr_model_6 = LogisticRegression(
penalty="l2",
solver="lbfgs",
max_iter=1000,
random_state=42,
class_weight={1.0: 1 - y_train.mean(), 0.0: y_train.mean()},
)
lr_model_6.fit(x_train, y_train)
model_performance_df = record_model_results(
model_performance_df=model_performance_df,
model_name="Enriched Logistic Regression with prev szn stats",
model=lr_model_6,
x_test=x_test,
y_test=y_test,
desc=f"Logistic Regression model trained on {', '.join(lr_model.feature_names_in_)} features with a standard scaler applied.",
standard_scaled=True
)
rf_model = RandomForestClassifier(
n_estimators=200,
max_depth=5,
random_state=42,
class_weight={1.0: 1 - y_train.mean(), 0.0: y_train.mean()},
)
rf_model.fit(x_train, y_train)
model_performance_df = record_model_results(
model_performance_df=model_performance_df,
model_name="Random Forest Classifier with prev szn stats",
model=rf_model,
x_test=x_test,
y_test=y_test,
desc=f"Random Forest Classifier model trained on {', '.join(rf_model.feature_names_in_)} features with a standard scaler applied.",
standard_scaled=True
)
xgb_model = XGBClassifier(
n_estimators=200,
max_depth=5,
random_state=42,
eval_metric="logloss",
scale_pos_weight=((len(y_train) - y_train.sum()) / y_train.sum())
)
xgb_model.fit(x_train, y_train)
model_performance_df = record_model_results(
model_performance_df=model_performance_df,
model_name="XGBoost Classifier with prev szn stats",
model=xgb_model,
x_test=x_test,
y_test=y_test,
desc=f"XGBoost Classifier model trained on {', '.join(xgb_model.feature_names_in_)} features with a standard scaler applied.",
standard_scaled=True
)
top_5_models = model_performance_df.loc[model_performance_df.sort_values(by=["recall"], ascending=False).index[:5]]
top_5_models
| model_id | model | desc | accuracy | precision | recall | f1_score | standard_scaled | class_weighting | |
|---|---|---|---|---|---|---|---|---|---|
| 4 | 4 | Extended Logistic Regression | Logistic Regression model trained on yardline_... | 0.612661 | 0.089121 | 0.557692 | 0.153683 | False | True |
| 6 | 6 | Extended Naive Bayes | Naive Bayes model trained on yardline_100, qua... | 0.552429 | 0.076114 | 0.547436 | 0.133646 | True | True |
| 7 | 7 | Enriched Logistic Regression with prev szn stats | Logistic Regression model trained on yardline_... | 0.646940 | 0.090057 | 0.505128 | 0.152861 | True | True |
| 8 | 8 | Random Forest Classifier with prev szn stats | Random Forest Classifier model trained on yard... | 0.699733 | 0.097862 | 0.457692 | 0.161247 | True | True |
| 5 | 5 | Naive Bayes | Naive Bayes model trained on yardline_100, qua... | 0.688981 | 0.093560 | 0.452564 | 0.155063 | True | True |
accuracy_chart = alt.Chart(top_5_models).mark_bar().encode(
x=alt.X("model:N", title="Model", axis=alt.Axis(labels=False, ticks=False)),
y=alt.Y("accuracy:Q", title="Accuracy", scale=alt.Scale(domain=[0, 1])),
color=alt.Color("model:N", title="Model Description"),
).properties(
width=150,
)
f1_chart = alt.Chart(top_5_models).mark_bar().encode(
x=alt.X("model:N", title="Model", axis=alt.Axis(labels=False, ticks=False)),
y=alt.Y("f1_score:Q", title="F1 Score", scale=alt.Scale(domain=[0, 1])),
color=alt.Color("model:N", title="Model Description"),
).properties(
width=150,
)
precision_chart = alt.Chart(top_5_models).mark_bar().encode(
x=alt.X("model:N", title="Model", axis=alt.Axis(labels=False, ticks=False)),
y=alt.Y("precision:Q", title="Precision", scale=alt.Scale(domain=[0, 1])),
color=alt.Color("model:N", title="Model Description"),
).properties(
width=150,
)
recall_chart = alt.Chart(top_5_models).mark_bar().encode(
x=alt.X("model:N", title="Model", axis=alt.Axis(labels=False, ticks=False)),
y=alt.Y("recall:Q", title="Recall", scale=alt.Scale(domain=[0, 1])),
color=alt.Color("model:N", title="Model Description"),
).properties(
width=150,
)
(accuracy_chart | precision_chart | recall_chart | f1_chart).properties(
title=alt.Title(
"The best model by our standard, recall, is not very remarkable by other metrics",
fontSize=25,
subtitle=["The best model was trained on the first set of extended features which included", "posteam and defteam in addition to the basic set of circumstantial features."],
subtitleFontSize=18
)
)
Conclusions and Future Work¶
Though weighting things by class helped get the recall score up the prediction probabilities became unrealistic, claiming at times a sack is nearly a 50/50 chance given a certain scenario. This appears to me to be a gross overestimate as a result of our trying to counteract the inbalanced classes.
In conclusion, I would advocate for a simpler model similar to those propsed in the initial brainstorming session which is basically an empirical estimate of the probability of a sack occuring given a set of input features we condition on. This will result in what I believe to be a more accurate likelihood of the outcome. However, it is really hard to justify how good or bad of an estimate of the probability it is because of the nature of the problem without having a guarunteed ground truth to aim for. Additionally, I would advocate for such an empirical estimate because it is very transparent and easy to explain to stakeholders how we are justifying the probabilty we have determined for the outcome.
Other possible future work could be to
- predict based on flipping a coin given that the coin comes up heads (results in a sack) with probability as predicted by a model we trained like above. See an example of this below.
- Additionally we did not incorporate yet player level information only game circumstantial and team level information.
probas = lr_model_6.predict_proba(x_test)[:,1]
random_nums = np.random.rand(len(probas))
predictions = (probas > random_nums).astype(int)
# predictions.sum()
# probas
# random_nums
# predictions
print(f"recall: {recall_score(y_test, predictions, zero_division=0)}")
print(f"precision: {precision_score(y_test, predictions, zero_division=0)}")
fig, ax = plt.subplots(figsize=(5, 5), dpi=160)
cm = confusion_matrix(y_test, predictions, labels=[0, 1])
ConfusionMatrixDisplay(cm).plot(colorbar=False, ax=ax)
plt.title("Confusion Matrix for Logistic\nRegression Model using class weighting", fontsize=16)
plt.show()
recall: 0.5230769230769231 precision: 0.06912910877668586